{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10.3 XGBoost算法案例实战2 - 信用评分模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.3.1 案例背景**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了降低不良贷款率，保障自身资金安全，提高风险控制水平，银行等金融机构会根据客户的信用历史资料构建信用评分模型给客户评分。根据客户的信用得分，可以估计客户按时还款的可能，并据此决定是否发放贷款及贷款的额度和利率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.3.2 多元线性回归模型**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>月收入</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>历史授信额度</th>\n",
       "      <th>历史违约次数</th>\n",
       "      <th>信用评分</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7783</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>32274</td>\n",
       "      <td>3</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7836</td>\n",
       "      <td>40</td>\n",
       "      <td>1</td>\n",
       "      <td>6681</td>\n",
       "      <td>4</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6398</td>\n",
       "      <td>25</td>\n",
       "      <td>0</td>\n",
       "      <td>26038</td>\n",
       "      <td>2</td>\n",
       "      <td>74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6483</td>\n",
       "      <td>23</td>\n",
       "      <td>1</td>\n",
       "      <td>24584</td>\n",
       "      <td>4</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5167</td>\n",
       "      <td>23</td>\n",
       "      <td>1</td>\n",
       "      <td>6710</td>\n",
       "      <td>3</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    月收入  年龄  性别  历史授信额度  历史违约次数  信用评分\n",
       "0  7783  29   0   32274       3    73\n",
       "1  7836  40   1    6681       4    72\n",
       "2  6398  25   0   26038       2    74\n",
       "3  6483  23   1   24584       4    65\n",
       "4  5167  23   1    6710       3    73"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_excel('信用评分卡模型.xlsx')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.提取特征变量和目标变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 通过如下代码将特征变量和目标变量单独提取出来，代码如下：\n",
    "X = df.drop(columns='信用评分')\n",
    "Y = df['信用评分']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.模型训练及搭建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 从Scikit-Learn库中引入LinearRegression()模型进行模型训练，代码如下：\n",
    "from sklearn.linear_model import LinearRegression\n",
    "model = LinearRegression()\n",
    "model.fit(X,Y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "各系数为:[ 5.58658996e-04  1.62842002e-01  2.18430276e-01  6.69996665e-05\n",
      " -1.51063940e+00]\n",
      "常数项系数k0为:67.16686603853304\n"
     ]
    }
   ],
   "source": [
    "# 4.线性回归方程构造\n",
    "print('各系数为:' + str(model.coef_))\n",
    "print('常数项系数k0为:' + str(model.intercept_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5.模型评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda\\Anaconda\\lib\\site-packages\\numpy\\core\\fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.\n",
      "  return ptp(axis=axis, out=out, **kwargs)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>          <td>信用评分</td>       <th>  R-squared:         </th> <td>   0.629</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.628</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   337.6</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Sat, 28 Dec 2019</td> <th>  Prob (F-statistic):</th> <td>2.32e-211</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>12:05:28</td>     <th>  Log-Likelihood:    </th> <td> -2969.8</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  1000</td>      <th>  AIC:               </th> <td>   5952.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>   994</td>      <th>  BIC:               </th> <td>   5981.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     5</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "     <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th>  <td>   67.1669</td> <td>    1.121</td> <td>   59.906</td> <td> 0.000</td> <td>   64.967</td> <td>   69.367</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>月收入</th>    <td>    0.0006</td> <td> 8.29e-05</td> <td>    6.735</td> <td> 0.000</td> <td>    0.000</td> <td>    0.001</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>年龄</th>     <td>    0.1628</td> <td>    0.022</td> <td>    7.420</td> <td> 0.000</td> <td>    0.120</td> <td>    0.206</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>性别</th>     <td>    0.2184</td> <td>    0.299</td> <td>    0.730</td> <td> 0.466</td> <td>   -0.369</td> <td>    0.806</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>历史授信额度</th> <td>   6.7e-05</td> <td> 7.78e-06</td> <td>    8.609</td> <td> 0.000</td> <td> 5.17e-05</td> <td> 8.23e-05</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>历史违约次数</th> <td>   -1.5106</td> <td>    0.140</td> <td>  -10.811</td> <td> 0.000</td> <td>   -1.785</td> <td>   -1.236</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>13.180</td> <th>  Durbin-Watson:     </th> <td>   1.996</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.001</td> <th>  Jarque-Bera (JB):  </th> <td>  12.534</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td>-0.236</td> <th>  Prob(JB):          </th> <td> 0.00190</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.721</td> <th>  Cond. No.          </th> <td>4.27e+05</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 4.27e+05. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                   信用评分   R-squared:                       0.629\n",
       "Model:                            OLS   Adj. R-squared:                  0.628\n",
       "Method:                 Least Squares   F-statistic:                     337.6\n",
       "Date:                Sat, 28 Dec 2019   Prob (F-statistic):          2.32e-211\n",
       "Time:                        12:05:28   Log-Likelihood:                -2969.8\n",
       "No. Observations:                1000   AIC:                             5952.\n",
       "Df Residuals:                     994   BIC:                             5981.\n",
       "Df Model:                           5                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "const         67.1669      1.121     59.906      0.000      64.967      69.367\n",
       "月收入            0.0006   8.29e-05      6.735      0.000       0.000       0.001\n",
       "年龄             0.1628      0.022      7.420      0.000       0.120       0.206\n",
       "性别             0.2184      0.299      0.730      0.466      -0.369       0.806\n",
       "历史授信额度        6.7e-05   7.78e-06      8.609      0.000    5.17e-05    8.23e-05\n",
       "历史违约次数        -1.5106      0.140    -10.811      0.000      -1.785      -1.236\n",
       "==============================================================================\n",
       "Omnibus:                       13.180   Durbin-Watson:                   1.996\n",
       "Prob(Omnibus):                  0.001   Jarque-Bera (JB):               12.534\n",
       "Skew:                          -0.236   Prob(JB):                      0.00190\n",
       "Kurtosis:                       2.721   Cond. No.                     4.27e+05\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "[2] The condition number is large, 4.27e+05. This might indicate that there are\n",
       "strong multicollinearity or other numerical problems.\n",
       "\"\"\""
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 利用3.2节模型评估的方法对此多元线性回归模型进行评估，代码如下：\n",
    "import statsmodels.api as sm\n",
    "X2 = sm.add_constant(X)\n",
    "est = sm.OLS(Y, X2).fit()\n",
    "est.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到模型整体的R-squared为0.629，Adj. R-Squared为0.628，整体拟合效果一般，可能是因为数据量偏少的原因。同时我们再来观察P值，可以发现大部分特征变量的P值都较小（小于0.05），的确是和目标变量：信用评分显著相关，而性别这一特征变量的P值达到了0.466，即与目标变量没有显著相关性，这个也的确符合经验认知，所以在多元线性回归模型中，我们其实可以把性别这一特征变量舍去。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.3.3 GBDT回归模型**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这里使用第九章讲过的GBDT回归模型同样来做一下回归分析，首先读取1000条信用卡客户的数据并划分特征变量和目标变量，这部分代码和上面线性回归的代码是一样的。\n",
    "# 1.读取数据\n",
    "import pandas as pd\n",
    "df = pd.read_excel('信用评分卡模型.xlsx')\n",
    "# 2.提取特征变量和目标变量\n",
    "X = df.drop(columns='信用评分')\n",
    "y = df['信用评分']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.划分训练集和测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 通过如下代码划分训练集和测试集数据：\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.模型训练及搭建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',\n",
       "                          init=None, learning_rate=0.1, loss='ls', max_depth=3,\n",
       "                          max_features=None, max_leaf_nodes=None,\n",
       "                          min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "                          min_samples_leaf=1, min_samples_split=2,\n",
       "                          min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "                          n_iter_no_change=None, presort='deprecated',\n",
       "                          random_state=None, subsample=1.0, tol=0.0001,\n",
       "                          validation_fraction=0.1, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 划分训练集和测试集完成后，就可以从Scikit-Learn库中引入GBDT模型进行模型训练了，代码如下：\n",
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "model = GradientBoostingRegressor()  # 使用默认参数\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5.模型预测及评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[70.77631652 71.40032104 73.73465155 84.52533945 71.09188294 84.9327599\n",
      " 73.72232388 83.44560704 82.61221486 84.86927209]\n"
     ]
    }
   ],
   "source": [
    "# 模型搭建完毕后，通过如下代码预测测试集数据：\n",
    "y_pred = model.predict(X_test)\n",
    "print(y_pred[0:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>预测值</th>\n",
       "      <th>实际值</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>70.776317</td>\n",
       "      <td>79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>71.400321</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>73.734652</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>84.525339</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>71.091883</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         预测值  实际值\n",
       "0  70.776317   79\n",
       "1  71.400321   80\n",
       "2  73.734652   62\n",
       "3  84.525339   89\n",
       "4  71.091883   80"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过和之前章节类似的代码，我们可以将预测值和实际值进行对比：\n",
    "a = pd.DataFrame()  # 创建一个空DataFrame \n",
    "a['预测值'] = list(y_pred)\n",
    "a['实际值'] = list(y_test)\n",
    "a.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6766094959559134\n"
     ]
    }
   ],
   "source": [
    "# 因为GradientBoostingRegressor()是一个回归模型，所以我们通过查看其R-squared值来评判模型的拟合效果：\n",
    "from sklearn.metrics import r2_score\n",
    "r2 = r2_score(y_test, model.predict(X_test))\n",
    "print(r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第1行代码从Scikit-Learn库中引入r2_score()函数；第2行代码将训练集的真实值和模型预测值传入r2_score()函数，得出R-squared评分为0.675，可以看到这个结果较线性回归模型获得的0.629是有所改善的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6766094959559132"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们还可以通过GradientBoostingRegressor()自带的score()函数来查看模型预测的效果：\n",
    "model.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.3.4 XGBoost回归模型**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 如下所示，其中前3步读取数据，提取特征变量和目标变量，划分训练集和测试集都与GBDT模型相同，因此不再重复，直接从第四步模型开始讲解：\n",
    "# 1.读取数据\n",
    "import pandas as pd\n",
    "df = pd.read_excel('信用评分卡模型.xlsx')\n",
    "# 2.提取特征变量和目标变量\n",
    "X = df.drop(columns='信用评分')\n",
    "Y = df['信用评分']\n",
    "# 3.划分测试集和训练集\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.模型训练及搭建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n",
       "             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,\n",
       "             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n",
       "             silent=True, subsample=1)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 划分训练集和测试集完成后，就可以从Scikit-Learn库中引入XGBRegressor()模型进行模型训练了，代码如下：\n",
    "from xgboost import XGBRegressor\n",
    "model = XGBRegressor()  # 使用默认参数\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5.模型预测及评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[70.00004  70.659004 73.217026 84.60378  72.09019  84.97985  73.81208\n",
      " 83.586205 81.829445 84.82512 ]\n"
     ]
    }
   ],
   "source": [
    "# 模型搭建完毕后，通过如下代码预测测试集数据：\n",
    "y_pred = model.predict(X_test)\n",
    "print(y_pred[0:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>预测值</th>\n",
       "      <th>实际值</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>70.000038</td>\n",
       "      <td>79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>70.659004</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>73.217026</td>\n",
       "      <td>62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>84.603783</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>72.090187</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         预测值  实际值\n",
       "0  70.000038   79\n",
       "1  70.659004   80\n",
       "2  73.217026   62\n",
       "3  84.603783   89\n",
       "4  72.090187   80"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过和之前章节类似的代码，我们可以将预测值和实际值进行对比：\n",
    "a = pd.DataFrame()  # 创建一个空DataFrame \n",
    "a['预测值'] = list(y_pred)\n",
    "a['实际值'] = list(y_test)\n",
    "a.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6783590081364717\n"
     ]
    }
   ],
   "source": [
    "# 因为XGBRegressor()是一个回归模型，所以通过查看R-squared来评判模型的拟合效果：\n",
    "from sklearn.metrics import r2_score\n",
    "r2 = r2_score(y_test, model.predict(X_test))\n",
    "print(r2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6783590081364717"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们还可以通过XGBRegressor()自带的score()函数来查看模型预测的效果：\n",
    "model.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.查看特征重要性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>特征名称</th>\n",
       "      <th>特征重要性</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>历史授信额度</td>\n",
       "      <td>0.389456</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>月收入</td>\n",
       "      <td>0.369048</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>年龄</td>\n",
       "      <td>0.159864</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>历史违约次数</td>\n",
       "      <td>0.066327</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>性别</td>\n",
       "      <td>0.015306</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     特征名称     特征重要性\n",
       "3  历史授信额度  0.389456\n",
       "0     月收入  0.369048\n",
       "1      年龄  0.159864\n",
       "4  历史违约次数  0.066327\n",
       "2      性别  0.015306"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过10.2.3节讲过的feature_importances_属性，我们来查看模型的特征重要性：\n",
    "features = X.columns  # 获取特征名称\n",
    "importances = model.feature_importances_  # 获取特征重要性\n",
    "\n",
    "# 通过二维表格形式显示\n",
    "importances_df = pd.DataFrame()\n",
    "importances_df['特征名称'] = features\n",
    "importances_df['特征重要性'] = importances\n",
    "importances_df.sort_values('特征重要性', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**补充知识点1：XGBoost回归模型的参数调优**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 通过和10.2.4节类似的代码，我们可以对XGBoost回归模型进行参数调优，代码如下：\n",
    "from sklearn.model_selection import GridSearchCV  \n",
    "parameters = {'max_depth': [1, 3, 5], 'n_estimators': [50, 100, 150], 'learning_rate': [0.01, 0.05, 0.1, 0.2]}  # 指定模型中参数的范围\n",
    "clf = XGBRegressor()  # 构建回归模型\n",
    "grid_search = GridSearchCV(model, parameters, scoring='r2', cv=5) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里唯一需要注意的是最后一行代码中的scoring参数需要设置成'r2'，其表示的是R-squared值，因为是回归模型，所以参数调优时应该选择R-squared值来进行评判，而不是分类模型中常用的准确度'accuracy'或者ROC曲线对应的AUC值'roc_auc'。\n",
    "通过如下代码获取最优参数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_search.fit(X_train, y_train)  # 传入数据\n",
    "grid_search.best_params_  # 输出参数的最优值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获得最优参数如下所示：\n",
    "{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "             colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n",
       "             max_depth=3, min_child_weight=1, missing=None, n_estimators=50,\n",
       "             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n",
       "             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n",
       "             silent=True, subsample=1)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 在模型中设置参数，代码如下：\n",
    "model = XGBRegressor(max_depth=3, n_estimators=50, learning_rate=0.1)\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.688448571624294\n"
     ]
    }
   ],
   "source": [
    "# 此时再通过r2_score()函数进行模型评估，代码如下（也可以用model.score(X_test, y_test)进行评分，效果一样）：\n",
    "from sklearn.metrics import r2_score\n",
    "r2 = r2_score(y_test, model.predict(X_test))\n",
    "print(r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此时获得的R-squared值如下所示：\n",
    "0.688。\n",
    "可以看到调参后的R-squared值优于未调参前的R-squared值0.678。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**补充知识点2：对于XGBoost模型，有必要做很多数据预处理吗？**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在传统的机器模型中，我们往往需要做挺多的数据预处理，例如数据的归一化、缺失值及异常值的处理等（这些会在第十一章着重讲解），但是对于XGBoost模型而言，很多预处理都是不需要的，例如对于缺失值而言，XGBoost模型会自动处理，它会通过枚举所有缺失值在当前节点是进入左子树还是右子树来决定缺失值的处理方式。\n",
    "\n",
    "此外由于XGBoost是基于决策树模型，因此区别于线性回归等模型，像一些特征变换（例如离散化、归一化或者叫作标准化、取log、共线性问题处理等）都不太需要，这也是树模型的一个优点。如果有的读者还不太放心，可以自己尝试下做一下特征变换，例如数据归一化，会发现最终的结果都是一样的。这里给大家简单示范一下，通过如下代码对数据进行Z-score标准化或者叫作归一化(这部分内容也会在11.3节进行讲解)："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.88269208, -1.04890243, -1.01409939, -0.60873764,  0.63591822],\n",
       "       [-0.86319167,  0.09630122,  0.98609664, -1.55243002,  1.27956013],\n",
       "       [-1.39227834, -1.46534013, -1.01409939, -0.83867808, -0.0077237 ],\n",
       "       ...,\n",
       "       [ 1.44337605,  0.61684833,  0.98609664,  1.01172301, -0.0077237 ],\n",
       "       [ 0.63723633, -0.21602705,  0.98609664, -0.32732239, -0.0077237 ],\n",
       "       [ 1.57656755,  0.61684833, -1.01409939,  1.30047599, -0.0077237 ]])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "X_new = StandardScaler().fit_transform(X)\n",
    "\n",
    "X_new  # 打印标准化后的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "利用标准化后的数据进行建模，看看是否有差别："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.6783590081364717\n"
     ]
    }
   ],
   "source": [
    "# 3.划分测试集和训练集\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=123)\n",
    "\n",
    "# 4.建模\n",
    "# 划分训练集和测试集完成后，就可以从Scikit-Learn库中引入XGBRegressor()模型进行模型训练了，代码如下：\n",
    "from xgboost import XGBRegressor\n",
    "model = XGBRegressor()  # 使用默认参数\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# 因为XGBRegressor()是一个回归模型，所以通过查看R-squared来评判模型的拟合效果：\n",
    "from sklearn.metrics import r2_score\n",
    "r2 = r2_score(y_test, model.predict(X_test))\n",
    "print(r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此时再对这个X_new通过train_test_split()函数划分测试集和训练集，并进行模型的训练，最后通过r2_score()获得模型评分，会发现结果和没有归一化的数据的结果一模一样，都为0.678。这里也验证了树模型不需要进行特征的归一化或者说标准化，此外树模型对于共线性也不敏感。\n",
    "\n",
    "通过上面这个演示，也可以得出这么一个读者经常会问到的一个疑问：需不需要进行某种数据预处理？以后如果还有这样的疑问，那么不妨就做一下该数据预处理，如果发现最终结果没有区别，那就能够明白对于该模型不需要做相关数据预处理。\n",
    "当然绝大部分模型都无法自动完成的一步就是特征提取。很多自然语言处理的问题或者图象的问题，没有现成的特征，需要人工去提取这些特征。\n",
    "\n",
    "综上来说，XGBoost的确比线性模型要省去很多特征工程的步骤，但是特征工程依然是非常必要的，这一结论同样适用于下面即将讲到的LightGBM模型。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
